library(ggplot2)
ggplot(mpg,aes(displ,hwy,color = factor(cyl)))+
geom_point()
#### 4.2.1 Mapping aesthetics to data
A scatterplot represents each observation as a point, positioned according to the value of two variables.
These attributes are called aesthetics, and are the properties that can be perceived on the graphic.
The scatterplot uses points, but were we instead to draw lines we would get a line plot. If we used bars, we’d get a bar plot.
ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_line() +
theme(legend.position = "none")
ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_bar(stat = "identity", position = "identity", fill = NA) +
theme(legend.position = "none")
Points, lines and bars are all examples of geometric objects, or geoms. Geoms determine the “type” of the plot. Plots that use a single geom are often given a special name.
ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_point() +
geom_smooth(method = "lm")
Unify units: scaling
the drawing system that ggplot2 uses, grid, takes care of mapping from the range of data to [0,1] for us. A final step determines how the two positions (x and y) are combined to form the final location on the plot. This is done by the coordinate system, or coord.
Before:
After:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth() +
facet_wrap(~year)
## `geom_smooth()` using method = 'loess'
each facet panel in each layer has its own dataset.
The smooth layer is different to the point layer because it doesn’t display the raw data, but instead displays a statistical transformation of the data.
The data is passed to a statistical transformation, or stat, which manipulates the data in some useful way.
Scale transformation occurs before statistical transformation so that statistics are computed on the scale-transformed data.
Each scale is trained on every dataset from all the layers and facets.
The scales map the data values into aesthetic value
layers are responsible for creating the objects that we perceive on the plot.
1: data and aesthetic mapping
2: a statistical transformation (stat)
3: a geometric object (geom)
4: a position adjustment
A coordinate system, or coord for short, maps the position of objects onto the plane of the plot.
First we create a plot with default dataset and aesthetic mappings:
p <- ggplot(mpg, aes(displ, hwy))
p
p+ geom_point()
geom_point() is a shortcut. Behind the scenes it calls the layer() function to create a new layer:
p + layer(
mapping = NULL,
data = NULL,
geom = "point",
#geom_params = list(),
stat = "identity",
#stat_params = list(),
position = "identity"
)
* mapping: using
aes() function. if NULL, default is used
* data: usually set as NULL, default used.
* geom: name of the geometric object to use to draw each observation.
* Stat: The name of the statistiucal transformation to use. identity keep the data as is.
* position: adjusting overlapping objects, like jittering, stacking or dodging.
The data on each layer doesn’t need to be the same.
fit a loess model and generate predictions from it.
mod <- loess(hwy ~ displ, data = mpg)
grid <- data.frame(displ = seq(min(mpg$displ),max(mpg$displ), length = 50))
grid$hwy <- predict(mod, newdata = grid)
grid
## displ hwy
## 1 1.600000 33.09286
## 2 1.710204 32.16100
## 3 1.820408 31.26635
## 4 1.930612 30.41403
## 5 2.040816 29.60168
## 6 2.151020 28.82979
## 7 2.261224 28.09612
## 8 2.371429 27.39752
## 9 2.481633 26.73611
## 10 2.591837 26.11707
## 11 2.702041 25.52830
## 12 2.812245 24.99810
## 13 2.922449 24.53340
## 14 3.032653 24.09418
## 15 3.142857 23.64634
## 16 3.253061 23.21696
## 17 3.363265 22.80828
## 18 3.473469 22.40978
## 19 3.583673 22.01095
## 20 3.693878 21.60128
## 21 3.804082 21.17025
## 22 3.914286 20.70708
## 23 4.024490 20.19083
## 24 4.134694 19.63934
## 25 4.244898 19.08893
## 26 4.355102 18.57596
## 27 4.465306 18.13676
## 28 4.575510 17.80767
## 29 4.685714 17.58594
## 30 4.795918 17.40438
## 31 4.906122 17.26494
## 32 5.016327 17.17248
## 33 5.126531 17.13186
## 34 5.236735 17.14795
## 35 5.346939 17.22433
## 36 5.457143 17.35357
## 37 5.567347 17.53353
## 38 5.677551 17.76418
## 39 5.787755 18.04546
## 40 5.897959 18.37732
## 41 6.008163 18.75972
## 42 6.118367 19.19262
## 43 6.228571 19.67595
## 44 6.338776 20.20968
## 45 6.448980 20.79376
## 46 6.559184 21.42814
## 47 6.669388 22.11278
## 48 6.779592 22.84761
## 49 6.889796 23.63261
## 50 7.000000 24.46771
isolate observations that are particularly far away from their predicted values.
mpg=mpg
std_resid <- resid(mod) / mod$s
outlier <- mpg[abs(std_resid) > 2,]
outlier
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 chevrolet corvette 5.7 1999 8 manual(m6) r 16 26
## 2 pontiac grand prix 3.8 2008 6 auto(l4) f 18 28
## 3 pontiac grand prix 5.3 2008 8 auto(s4) f 16 25
## 4 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44
## 5 volkswagen new beetle 1.9 1999 4 manual(m5) f 35 44
## 6 volkswagen new beetle 1.9 1999 4 auto(l4) f 29 41
## # ... with 2 more variables: fl <chr>, class <chr>
Plot with two dataset:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_line(data = grid, colour = "blue", size = 1.5) +
geom_text(data = outlier, aes(label = model))
You need the explicit data = in the layers.
Less clear method:
ggplot(mapping = aes(displ, hwy)) +
geom_point(data = mpg) +
geom_line(data = grid) +
geom_text(data = outlier, aes(label = model))
aes() describe how variables are mapped to visual properties or aesthetics. Never refer to a variable with $ in aes().
Same plots:
ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point()
ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = class))
ggplot(mpg, aes(displ)) + geom_point(aes(y = hwy, colour = class))
ggplot(mpg) + geom_point(aes(displ, hwy, colour = class))
The distinction is important when you start adding additional layers.
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme(legend.position = "none")
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(method = "lm", se = FALSE) +
theme(legend.position = "none")
Generally, you wnat to set up the mappings to illuminate the structure underlying the graphic and minimize typing.
If you want appearance to be governed by a variable, put the specification inside aes(); if you want override the default size or colour, put the value outside of aes().
ggplot(mpg, aes(cty, hwy)) + geom_point(colour = "darkblue")
ggplot(mpg, aes(cty, hwy)) + geom_point(aes(colour = "darkblue"))
The second plot maps (not sets) the colour to the value ‘darkblue’. This effectively creates a new variable containing only the value ‘darkblue’ and then scales it with a colour scale.
A third approach is to map the value, but override the default scale:
ggplot(mpg, aes(cty, hwy)) +
geom_point(aes(colour = "darkblue")) +
scale_colour_identity()
if you want to display multiple layers with varying parameters, you can “name” each layer:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(aes(colour = "loess"), method = "loess", se = FALSE) +
geom_smooth(aes(colour = "lm"), method = "lm", se = FALSE) +
labs(colour = "Method")
ggplot(mpg, aes(trans, cty)) +
geom_point() +
stat_summary(geom = "point", fun.y = "mean", colour = "red", size = 4)
ggplot(mpg, aes(trans, cty)) +
geom_point() +
geom_point(stat = "summary", fun.y = "mean", colour = "red", size = 4)
First one is better.
Use variables generated by stats:
ggplot(diamonds, aes(price)) + geom_histogram(binwidth = 500)
ggplot(diamonds, aes(price)) +
geom_histogram(aes(y = ..density..), binwidth = 500)
ggplot(diamonds, aes(price, colour = cut)) +
geom_freqpoly(binwidth = 500) +
theme(legend.position = "none")
ggplot(diamonds, aes(price, colour = cut)) +
geom_freqpoly(aes(y = ..density..), binwidth = 500) +
theme(legend.position = "none")
Three adjustments apply primarily to bars:
position_stack() stack overlapping bars (or areas) on top of each other.
position_sill() stack overlapping bars, scaling so the top is always at 1.
position_dodge() place overlapping bars (or boxplots) side-by-side.
dplot <- ggplot(diamonds, aes(color, fill = cut)) +
xlab(NULL) +
ylab(NULL) +
theme(legend.position = "none")
# position stack is the default for bars, so geom_bar()
# is equivalent to geom_bar(position = "stack") .
dplot + geom_bar()
dplot + geom_bar(position = "fill")
dplot + geom_bar(position = "dodge")
dplot + geom_bar(position = "identity", alpha = 1 / 2, colour = "grey50")
ggplot(diamonds, aes(color, colour = cut)) +
geom_line(aes(group = cut), stat = "count") +
xlab(NULL) +
ylab(NULL) +
theme(legend.position = "none")
There are three position adjustments that are primarily useful for points:
position nudge(): move points by a fixed offset.
position jitter(): add a little random noise to every position.
position jitterdodge(): dodge points within groups, then add a little random noise.
ggplot(mpg, aes(displ, hwy)) + geom_point(position = "jitter")
ggplot(mpg, aes(displ, hwy)) +
geom_point(position = position_jitter(width = 0.05, height = 0.5))
ggplot(mpg, aes(displ, hwy)) +
geom_jitter(width = 0.05, height = 0.5)